Power Tools 1993 November

home *** CD-ROM | disk | FTP | other *** search

/ Power Tools 1993 November - Disc 1 / Power Tools Plus (Disc 1 of 2)(November 1993)(HP).iso / sys / mpwp / mpwp.txt < prev next >

Wrap

Text File | 1992-04-30 | 24.4 KB | 413 lines

Multiprocessing White Paper Introduction Today's users of computer systems have been demanding more and more power. To accommodate this need, computer vendors have been pushing the technology limits of hardware and have turned to RISC systems due to their superior performance. But some application environments require still more power than is available from single processor systems. Multiprocessor systems have thus been developed to provide a further performance boost. The idea behind multiprocessing is simple. If one processor does not provide enough performance, then add additional processors. Multiprocessing (MP) systems let users reach high performance levels without waiting for next generation CPUs, while at the same time leveraging existing hardware. The challenge of multiprocessing (MP) systems is ensuring applications are able to take advantage of the full performance available to them from additional processors. MP systems have been available for a number of years. IBM mainframes since the late 1960s have used multiprocessing to provide a range of different performance levels. HP also implemented multiprocessing in the early 1980s with its line of UNIX workstations. Just as minicomputers are moving into mainframe levels of performance, multiprocessing is moving more and more from proprietary environments into the open environment of UNIX. Figure 1 shows how HP's rapid performance growth of PA-RISC systems has been enhanced with the extra performance levels available with MP. The availability of UNIX MP systems is an exciting development because it means that users and businesses will not just have the advantage of using open systems, but will also have the power needed for the most demanding environments. [Figure 1: Illustration: Series 800 Performance Growth] Definition of Multiprocessing Multiprocessing is a design where many CPUs are connected to provide additional computing power to allow many tasks to run in parallel. This is more than multitasking. Multitasking systems allow many jobs or processes to be running on a system, rather than running each job serially to completion. However, that does not mean that at any one moment that all jobs are running. For example, figure 2 shows how a single CPU system parses out CPU time to multiple jobs and switches between running jobs. If the CPU switches between jobs quickly, it gives a user the appearance that many jobs are indeed running simultaneously, even though in actuality only one job is running at a time. For multiprocessing systems, figure 2 shows that each CPU continues to switch between running jobs, but now jobs are truly running in parallel with jobs running on each CPU (of course each individual CPU continues to multitask). [Figure 2: Illustration: Job Scheduling] Multiprocessing also means more than having a multiuser or timeshare system. As with multitasking, a system with a single CPU can switch between jobs for different users and, if it has the performance to switch quickly enough, gives each user the appearance that they have a dedicated computer at the other end of their display. A multiprocessor system essentially provides more CPU resources that can thus be shared simultaneously by many users without a drop in response time. Another common misconception with MP systems is that they are Fault Tolerant systems. While Fault Tolerant systems do have multiple CPUs as figure 3 shows for the HP 9000 Series 1200 systems, they also have fully redundant hardware for system buses, memory, I/O and peripherals and power supply. This redundancy protects Fault Tolerant systems from unplanned downtime due to any one hardware component going down. However, MP systems can take advantage of the additional CPUs to ensure continued operation if one CPU fails. For example, if a CPU failure in the HP 9000 Series 890S MP Corporate Business Server system causes the system to go down, the HP 890S will bring itself down and then boot up again with the faulty CPU reconfigured out of the system. The HP 890S will continue to run at a reduced performance level until the CPU board can be repaired or replaced. [Figure 3: Illustration: HP 9000 Series 1200 Fault Tolerant System] Building an MP system requires more than just adding additional CPU boards to a system. In actuality, there is a continuum of multiprocessing systems from loosely coupled machines on a network to multiple CPUs in one system acting as if there was only one CPU in the machine. The key measure of MP systems is how transparent the implementation is to users and programmers and also how well the additional power of the extra CPUs is utilized. Types of Multiprocessing Systems Several terms have been suggested to help describe different types of computer systems and provide a useful framework for understanding multiprocessing systems in relation to other types of systems. Computers can be grouped into 3 classes: SISD (single instruction, single data stream). A SISD machine is a typical single CPU system where one CPU executes one instruction and works on one piece of data at any one time. SIMD (single instruction, multiple data stream). One example of a SIMD machine is a vector processor, where an array of data is input into several registers and then the same operation (eg. multiply by pi) is performed on all the data simultaneously. SIMD machines are mostly used for scientific and engineering applications where intense numeric computations and simulations are needed. MIMD (multiple instructions, multiple data stream). This is where many CPUs execute different instructions on different pieces of data. However, MIMD systems differ on how instructions and data are shared between each CPU and how much each CPU interacts with the other CPUs. Multiprocessing computer systems are MIMD machines with the CPUs in one system sharing all the system resources such as memory, I/O and buses. There are two main types of multiprocessing systems in use today, asymmetric and symmetric. Figure 4 shows an asymmetric multiprocessing system found in several graphic workstations. Even though there are two processors in the system, they are not equal. The main processor runs the operating system and user programs, accesses the main memory, and also controls the actions of the attached graphics processor. The attached graphics processor handles the displaying and updating of the graphical display, but does not actually modify any of the data in the main memory nor does it execute any user programs. The attached processor does improve the performance of the overall system, because the main processor does not need to spend its resources on the graphical display, but it does not provide a user with the full power of the two processors. [Figure 4: Illustration: Graphics Workstation] Asymmetric multiprocessing systems are relatively easy to implement because less modifications need to be done to an operating system, memory bus structures and I/O. However, asymmetric systems have a potential performance bottleneck with the main processor. Figure 5 shows an expanded asymmetric system where there is also an attached vector processor and floating point processor, in addition to a graphics processor. The three attached processors are controlled by the main processor and can thus sit idle if the main processor becomes busy. Asymmetric processors are best suited for special tasks such as heavy numeric calculations or simulations where the system can be tailored for the specific task at hand. [Figure 5: Illustration: Graphics Workstation] In symmetric multiprocessing, all processors are equal and are not specialized for specific tasks. Figure 6 shows the HP 890S symmetric multiprocessing system where each processor can access memory, I/O and other parts of the system. The operating system and hardware components need to maintain data consistency, job scheduling and message passing between processors. There are two main implementations of symmetric multiprocessing: master/slave and fully symmetric. While the hardware may be symmetric, the differences between these two implementations are mainly in how the operating system handles two main tasks: job scheduling and kernel access. [Figure 6: Illustration: HP 890S Multiporcessing System] The job scheduling algorithm determines how jobs are parsed out and scheduled to each processor and is a major factor in overall multiprocessor performance. Figure 7 shows the job scheduling for master/slave systems, where one processor (master) assigns jobs to the other processors (slaves). In fully symmetric systems, each processor receives the next job that is waiting in a global system queue. [Figure 7: Illustration: Job Scheduling] To show how the two scheduling routines work, think of waiting for the next available cashier at a store. In a master/slave store, one of the cashiers has the responsibility of assigning customers to the next available cashier. If the master cashier is busy when one of the slave cashiers signals that they are ready for another customer, there will be a delay before a customer is sent to the slave cashier, slowing down the overall throughput. Or this store might modify the routine slightly and have the master cashier assign an incoming customer straight away to one cashier. However, if for some reason there is a holdup at one of the cashiers (because of some pricing difficulties), then all the other customers in line for that cashier are stuck, even though other cashier lines continue to move quickly. Either way, overall throughput is lowered as the master cashier (processor) must spend time assigning and possibly reassigning customers (jobs) to the other slave cashiers (processors). Also, individual customers (jobs) may find that they may have to wait much longer than other customers before they are serviced by a cashier (processor), just because they were assigned to a slow line to begin with. In contrast, in a store that implements a fully symmetric system, no one cashier assigns customers to the other cashiers. All incoming customers line up in one queue and when they get to the head of the line, they go immediately to the next available cashier. The advantage is that no extra time nor resources are spent allocating customers (jobs) to specific cashiers (processors) and that any queued customer (job) is more likely to be serviced in a timely fashion. Kernel access, or the lack of access, can also hinder multiprocessor performance. As figure 8 shows, some implementations of master/slave systems limit the execution of the UNIX kernel (the operating system core) and access to I/O to only the master processor, further pinching the master processor bottleneck. Examples of master/slave multiprocessing can be seen in systems from SUN and Solbourne. These type of systems are best suited to environments that have little I/O activity or have intensive numerical tasks that can be easily scheduled. In a heavy multiuser environment, users of master/slave symmetric multiprocessing systems (while gaining some improvement) will fail to see the extra performance promised from additional processors as the extra processors will spend more and more time waiting for kernel or I/O requests to be filled by the master processor. [Figure 8: Illustration: Kernel and I/O Access] Fully symmetric multiprocessing systems, such as the HP 890S, are more difficult to implement, but are better suited to heavy multiuser or I/O intensive environments, such as OLTP and RDBMS applications. Each processor is an equal to other processors and so can access I/O or the UNIX kernel, instead of sending all such requests to a master processor and then waiting for service. In general, SMP systems provide better scaling for commercial environments, such as Online Transaction Processing (OLTP) applications. Mixed workloads would also work best on a SMP system, because they would be most likely to have a high amount of I/O and kernel calls. Application fit with Multiprocessing Applications that appear to the UNIX system as just one large process would most likely not run faster on a multiprocessing system. Such applications are called single threaded and typically can not be split up among multiple processors. Large batch jobs, common in commercial environments, is one example of a single threaded application. In order to be split up over multiple processors, a single threaded application needs to be either "broken up" explicitly into multiple threads or processes, or implicitly broken up with the help of special parallel compilers. However, general purpose parallel compilers are still not available for the commercial marketplace. For today, large batch applications would run best on SMP systems where each individual processor has high performance, rather than a SMP system using many lower performance processors. The balance between batch jobs and mixed applications (such as OLTP) is one of the reasons that the HP 890S multiprocessing system was implemented with high performance uniprocessors, rather than with several more lower performance uniprocessors. The HP 890S offers the fastest combined batch and OLTP system performance due to its fast individual processors. RDBMS applications can also be tuned explicitly for MP systems for increase OLTP performance. HP has and continues to work closely with key industry leading RDBMS solution providers to optimize their applications specifically for the HP 890S MP system. Multiprocessing Challenges A fully symmetric multiprocessing system needs to overcome several hurdles without compromising the performance potential of additional processors. The main challenges are ensuring data integrity, job scheduling, I/O and kernel access, and application transparency. The fully symmetric multiprocessing implementation of the HP 890S will be used as an example of how these multiprocessing design challenges have been meet. Ensuring Data Integrity As seen in figure 6, the HP 890S is a tightly coupled system with all processors sharing the same memory. It is very likely that different jobs running on different processors will need to access the same memory locations, and thus the system must ensure that data integrity is maintained at all times with little or no contention. The HP 890S solves this problem in hardware with a "snoopy" cache protocol. All processors have their own local cache and memory controllers, and manipulate data in memory through their caches. Processors can either share cache lines, or mark lines as private. Each memory controller listens to every transaction on the System Memory Bus (SMB) (or "snoops," leading to the snoopy term) and act as needed to maintain data integrity. For example, if processor A and B share a cache line, but now processor A needs to write to the shared line, A broadcasts the write and marks the cache line as private. In the meantime, processor B picks up the write request and thus marks its copy of the cache line as invalid, leaving the modified line on processor A as the only valid copy. The HP 890S also avoids another potential bottleneck due to the extra traffic on the SMB. The SMB has a bandwidth of 800 MB/s, ensuring that bus contention will not occur between the processors and thus will not decrease performance. Job Scheduling A fully symmetric MP system has already been shown in figure 7 to use a single job queue for all processors. The next waiting job is sent to the next available processor. While a single job queue is an efficient MP scheduling method in general because of its inherent dynamic load balancing, it can be made more efficient by recognizing that not all jobs finish after one run on a processor. Often jobs or processes are scheduled to run for some length of time and are then swapped out and placed back in the job queue while another job is swapped in. However, a process builds up local data structures on the processor that it runs on (such as data into cache). If that process is later scheduled to run on another processor, then the second processor must spend time rebuilding the same data structures that are resident on the first processor. The HP 890S removes this potential source of inefficiency by using a heuristic scheduling algorithm where each process develops an affinity to one particular processor, but can move to another processor if needed for load balancing. To understand this heuristic scheduling method, imagine a process is handled the same way as a person who constantly checks in and out of a hotel. If the hotel kept checking this person into the same room, then some of the person's baggage could remain in the room when the person wasn't in (assuming of course that everyone else has the courtesy to leave it alone), making it very easy and quick for this person (and the hotel) to check back in. If it was necessary to move to another room, or if this person was not coming back, only then would the effort be spent to pack up all this person's belongings and move them. Kernel Access In order to maximize performance, the HP 890S allows multiple processes to be executed in the UNIX kernel at the same time. Without this capability, it is very likely that processors could sit idle while they wait for access to the kernel. The HP-UX kernel data structures are broken up into several pieces with synchronization variables called semaphores used to protect each block of data structures. A processor must lock the appropriate semaphore before they can access the desired kernel section. If another processor previously locked the semaphore, then the requesting processor must wait until the semaphore is unlocked. Kernel contention is reduced by having each semaphore lock only a small part of the kernel. The kernel semaphores also ensure that no I/O collisions occur between different processors. This is in contrast to master/slave systems where the UNIX kernel only runs on the master processor, thus there is no need to add semaphores to the kernel. This is one reason why master/slave systems are easier to implement. Application Transparency Multiprocessor systems should ideally be transparent to applications, meaning that an application can run on single processor or multiprocessor systems without any modification. For most applications, this will be the case for the HP 890S. The HP-UX system call interfaces for the multiprocessor systems are the same as for a single processor system. Applications that are structured as a set of cooperating processes may experience some problems in a multiprocessor system because processes may not execute in the same order as in a single processor system. These applications should be tested to identify any potential timing problems. Future Multiprocessing Trends Many system vendors will continue to refine their multiprocessing offerings. Several vendors such as HP will incorporate advanced technology for improved multiprocessor performance. HP will continue to enhance its multiprocessing offering. Presently HP offers a 4-way multiprocessor system with the HP 890S Corporate Business Server. The HP 890S will be increased into an 8-way and then later a 16-way multiprocessing system. In order to support these higher levels of multiprocessing, HP will break up the HP-UX kernel into smaller pieces to allow for increased simultaneous access by many more CPUs. Each kernel piece and data structure will continue to be protected with semaphores. HP-UX will also support multiple threads in the kernel. Threads are special light-weight UNIX processes. Traditional UNIX processes (called a task) can not be broken down into smaller units. If a UNIX system needs to start a new task, it does so with some overhead. With threads, a new process can be started with a thread with less overhead (by doing things such as sharing the same address space and common memory), leading to increased performance. Also, tasks can be broken down into several related threads, leading to a higher level of granularity. Having this increased granularity will lead to improved multiprocessor performance because of parallelization inherent with threads. HP-UX will also implement a communication mechanism called ports as a way for threads to know about each other, to talk to one another and to synchronize related threads. Support for threads will be added first for user applications and then for the HP-UX kernel. With these enhancements, the HP 890S will continue to offer significant performance increases over the next few years. Massively Parallel Systems Much attention has been given to research on massively parallel multiprocessor systems. Figure 9 shows that while the hardware implementation is straight forward and the potential performance gains quite large, the software needed to split up processes, coordinate them and maintain data consistency is not straight forward. SMP systems like the HP 890S are extensions of single processor systems with a common bus to easily share and maintain consistency of a common memory. An SMP software environment is very similar, if not identical, to a single processor software environment, with the benefit of existing applications running without modification. In contrast, applications will still have to be written specifically for massively parallel systems. Applications such as complex numeric simulations (eg. airflow analysis for the space shuttle) that can easily be broken down into individual and well defined subprocesses will continue to be the best fit for massively parallel systems. For commercial applications, special parallel compilers and development tools that would automatically break up applications while maintaining data consistency still need to be developed. While progress is being made in this area, general purpose parallel compilers are still estimated to be years away. [Figure 9: Illustration: Massively Parallel System] Distributed Computing In many respects, there are similarities between massively parallel multiprocessing systems and computer environments made up of multiple computers connected by a high-speed network, often called a multicomputer. Figure 10 shows that a multicomputer consists of distributed systems (with their own private memories) connected together, executing processes in parallel, and using messages to synchronize and maintain data consistency. In order to provide a standards-based framework for multicomputers (or distributed computing), the Open Software Foundation (OSF) has developed the Distributed Computer Environment (DCE) specification. DCE provides a common framework for sending process requests over a network and also a distributed file system that can be shared by many computers. With DCE, a large application can be written into several parts that can then be parsed out to several computers on the network using Remote Procedure Calls (RPC). Each computer works in parallel on its portion, with the result that the overall application executes much faster than it would on one individual system. Multicomputer systems have the advantage of using general purpose computers which are typically very cost effective and can continue to execute non-distributed applications as well as distributed ones. Of course, a SMP system could be one of the systems in a multicomputer environment. [Figure 10: Illustration: Distributed Computing] Summary Multiprocessing systems provide the extra levels of performance needed for demanding environments. While there are several types of multiprocessing implementations, fully symmetric multiprocessing systems offer the most balanced performance for commercial and mixed workload environments. SMP systems most fully utilize the extra performance available with additional processors. Systems such as the HP 890S provide SMP transparently to existing applications that were originally written for single processor systems. Standards development groups such as OSF will provide operating system technologies that will improve multiprocessing performance. While massively parallel systems will continue to be advanced, SMP systems with multicomputer networks will remain cost effective solutions for meeting general purpose computing needs. Associated files: mpwp01.gal, mpwp01.plt, mpwp02.gal, mpwp02.plt, mpwp03.gal, mpwp03.plt, mpwp04.gal, mpwp04.plt, mpwp05.gal, mpwp05.plt, mpwp06.gal, mpwp06.plt, mpwp07.gal, mpwp07.plt, mpwp08.gal, mpwp08.plt, mpwp09.gal, mpwp09.plt, mpwp10.gal, mpwp10.plt, Multiprocessing White Paper